nycflights13library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(nycflights13)
You would need origin and dest for each flight, along with the lat and long for each origin and destination airport. This means that you would need to combine the flights and airports tables.
weather and airports. What is the relationship and how should it appear in the diagram?origin in weather connects to faa in airports.
weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?It would connect with dest in flightsso that weather records for the destination airport could be identified as well.
You could have a specialdays table with a dayname as its primary key and then connect to flights via year, month and day variables.
flights.flights <- flights %>%
mutate(flight_id = row_number())
flights %>%
count(flight_id) %>%
filter(n > 1)
Identify the keys in the following datasets
Lahman::Batting,babynames::babynamesnasaweather::atmosfueleconomy::vehiclesggplot2::diamonds
(You might need to install some packages and read some documentation.)
library(Lahman)
Batting
Batting %>%
count(playerID, yearID, stint) %>%
filter(n > 1)
library(babynames)
babynames
babynames %>%
count(year, sex, name) %>%
filter(nn > 1)
library(nasaweather)
atmos
atmos %>%
count(lat, long, year, month) %>%
filter(n > 1)
library(fueleconomy)
vehicles
vehicles %>%
count(id) %>%
filter(n > 1)
diamonds
This doesn’t appear to have a primary key because each combination of diamond attributes wouldn’t be guaranteed to be unique (even if they are unique within the current data set). It would be best to create a surrogate key.
diamonds <- diamonds %>%
mutate(id = row_number())
Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers.Batting$playerID connects to Master$playerID. Salaries$playerID connects to Master$playerID. Batting and Salaries could be connected via playerID, yearID, and teamID.
Master$playerID connects to Managers$playerID. Master$playerID connects to AwardsManagers$playerID. Managers and AwardsManagers via playerID and yearID.
How would you characterise the relationship between the Batting, Pitching, and Fielding tables?
Batting
Pitching
Fielding
They should have a one-to-one relationship with matching primary keys (playerID, yearID, stint) across the three tables - with most position players only having records for Batting and Fielding, while pitchers are likely to have records only in Pitching and Fielding.
airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
Attaching package: ‘maps’
The following object is masked from ‘package:purrr’:
map
(Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.)
You might want to use the size or colour of the points to display the average delay for each airport.
flights %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
left_join(airports, c("dest" = "faa")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(colour = avg_delay)) +
coord_quickmap()
lat and lon) to flights.flights %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"))
flights %>%
group_by(tailnum) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
right_join(planes) %>%
ggplot(aes(year, avg_delay)) +
geom_point()
Joining, by = "tailnum"
There doesn’t appear to be a clear relationship.
flights %>%
group_by(year, month, day, hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
left_join(weather) %>%
ggplot(aes(temp, avg_delay)) +
geom_point(alpha = 0.1)
Joining, by = c("year", "month", "day", "hour")
flights %>%
group_by(year, month, day, hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
left_join(weather) %>%
ggplot(aes(wind_speed, avg_delay)) +
geom_point(alpha = 0.1)
Joining, by = c("year", "month", "day", "hour")
flights %>%
group_by(year, month, day, hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
left_join(weather) %>%
ggplot(aes(precip, avg_delay)) +
geom_point(alpha = 0.1)
Joining, by = c("year", "month", "day", "hour")
flights %>%
group_by(year, month, day, hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
left_join(weather) %>%
ggplot(aes(pressure, avg_delay)) +
geom_point(alpha = 0.1)
Joining, by = c("year", "month", "day", "hour")
flights %>%
group_by(year, month, day, hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
left_join(weather) %>%
ggplot(aes(visib, avg_delay)) +
geom_point(alpha = 0.1)
Joining, by = c("year", "month", "day", "hour")
flights %>%
filter(year == 2013, month == 6, day == 13) %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
left_join(airports, c("dest" = "faa")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(colour = avg_delay)) +
coord_quickmap()
There were severe storms in the mid-Atlantic region.
tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)flights %>%
anti_join(planes, by = "tailnum") %>%
count(carrier)
?planes
A missing tailnum means that the metadata for that plane hasn’t been recorded. The vast majority of these cases come from two carriers (AA and MQ). Looking at the help file for planes explains that these two carriers report fleet numbers rather than tail numbers.
flights_100 <- flights %>%
count(tailnum) %>%
filter(n >= 100)
semi_join(flights, flights_100)
Joining, by = "tailnum"
fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.semi_join(vehicles, common)
Joining, by = c("make", "model")
worst_delays <- flights %>%
group_by(year, month, day, hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay)) %>%
head(48)
semi_join(weather, worst_delays) %>%
ggplot(aes(temp)) +
geom_histogram()
Joining, by = c("year", "month", "day", "hour")
It’s possible to plot histograms for the different weather variables in a search for patterns. The temperature example above suggests that sometimes the delays happened on quite cold days but more of them happened in mild or warm weather.
anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?anti_join(flights, airports, by = c("dest" = "faa"))
This shows flights that went to a destination not listed in the airports database - from a quick inspection and search, many of these seem to be to airports in Puerto Rico or elsewhere in the Caribbean, ie outside the mainland United States.
anti_join(airports, flights, by = c("faa" = "dest"))
This shows airports that are listed in the airports database but which weren’t the destination for any flights in the flights database.
# Identify any planes (by `tailnum` that didn't have only a single carrier recorded for their flight records)
multi_carrier <- flights %>%
count(tailnum, carrier) %>%
count(tailnum) %>%
filter(nn != 1)
semi_join(flights, multi_carrier) %>%
count(tailnum, carrier)
Joining, by = "tailnum"